Parallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models\r\ngenerate highly-structured communication workloads from messages propagating along graph edges. We can statially expose\r\nthis structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area,\r\nlower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables\r\nlightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize\r\nour workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow.\r\nThis allows us to demonstrate speedups between 1.2Ã?â?? and 22Ã?â?? (3.5Ã?â?? mean), area reductions (number of Processing Elements)\r\nbetween 3Ã?â?? and 15Ã?â?? (9Ã?â?? mean) and dynamic energy savings between 2Ã?â?? and 3.5Ã?â?? (2.7Ã?â?? mean) over a range of real-world graph\r\napplications in the BSP compute model. We deliver speedups of 0.5ââ?¬â??13Ã?â?? (geomean 3.6Ã?â??) for Sparse Direct Matrix Solve (Token\r\nDataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such\r\ntraffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.
Loading....